Exploiting spatial sparsity for mult i- wavelength imaging in optical interferometry 
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Optical interferometers provide multiple wavelength measurements. In order 
to fully exploit the spectral and spatial resolution of these instruments, new al- 
gorithms for image reconstruction have to be developed. Early attempts to deal 
with multi-chromatic interfcrometric data have consisted in recovering a gray 
image of the object or independent monochromatic images in some spectral 
bandwidths. The main challenge is now to recover the full 3-D (spatio-spectral) 
brightness distribution of the astronomical target given all the available data. 
We describe a new approach to implement multi- wavelength image reconstruc- 
tion in the case where the observed scene is a collection of point-like sources. 
We show the gain in image quality (both spatially and spectrally) achieved by 
globally taking into account all the data instead of dealing with independent 
spectral slices. This is achieved thanks to a regularization which favors spa- 
tial sparsity and spectral grouping of the sources. Since the objective function 
is not differentiable, we had to develop a specialized optimization algorithm 
which also accounts for non-negativity of the brightness distribution. 
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1. Context 

The objective of stellar interfcrometric imaging is to re- 
cover an approximation of the specific brightness dis- 
tribution I\{9) of observed astronomical objects given 
measurements providing incomplete samples of the spa- 
tial Fourier transform of I\(9). Reconstruction of a 
monochromatic image from optical interferometry data is 
a challenging task which has been the subject of fruitful 
research and resulted in various algorithms (e.g., MiRA 
[1], Bsmem [2, 3], WlSARD [4], the building-block method 
[5]). When dealing with multi-spectral data, a first pos- 
sibility is to process each wavelength independently and 
reconstruct a monochromatic image for each subset of 
measurements from a given spectral channel. For in- 
stance, this is what have been done by le Bouquin et al. 
[6] for the multi-spectral images of the Mira star T Lep. 
Another possibility is to exploit some assumed spectral 
continuity of I\(0) and process the multi-spectral data 
globally to reconstruct an approximation of the 3-D dis- 
tribution I\(6). This computationally more challenging 
approach can potentially lead to better reconstructions. 
Significant improvements have been shown when follow- 
ing such spatio-spectral processing in the context of in- 
tegral field spectral spectroscopy [7-9]. This paper de- 
scribes a method to jointly reconstruct multi-spectral op- 
tical interfcrometric data. 

In order to simplify the problem, we restricted our 



study to the cases where the complex visibilities are 
observed and where the observed scene is a collection 
of point-like sources. This correspond, for instance, to 
the science case of the instrument Gravity which will 
be installed at the Very Large Telescope Interferometer 
(VLTI) to carry out astrometry with absolute phase ref- 
erence of stars in the galactic center or in globular clusters 
[10]. In some sense, this latter assumption makes our 
algorithm a successor of the Clean algorithm [11, 12] 
and the building-block method [5] developed for recov- 
ering monochromatic images from radio and optical in- 
terfcrometric data respectively. The Clean algorithm 
have been proposed for the processing of Gravity in- 
terferometric data [13] but only considering "gray" data 
and not more than three stars in the field of view. In 
addition to processing multi-variate data, we also intro- 
duce the explicit minimization of a non-differcntiable reg- 
ularization term so as to favor spatial sparsity of the 
reconstructed brightness distribution in a way which is 
known to be more efficient [14, 15] than greedy algo- 
rithms like Clean [16] or the building-block method [5]. 
The method presented in this paper improves over early 
developments presented as an invited paper at the 2012 
SPIE Conf. on Astronomical Telescopes & Instrumenta- 
tion in Amsterdam [17]. 

Our paper is organized as follows: we first summa- 
rize the inverse approach for image reconstruction from 
interfcrometric data and discuss various possibilities to 
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impose spatial sparsity, we then detail our algorithm 
for minimizing the objective function; finally we present 
some results on simulated data and discuss the advan- 
tages of our approach. 

2. Method 

A. General principle of image reconstruction 

Following an inverse approach, we state image recon- 
struction as a constrained optimization problem [18]: 

x + = argmin{/ data (cc) + (J,f pr j or (x)} (1) 

ccGX 

where x G K' 31 ' are the sought image parameters, |x| = 
Card(:r) is the number of parameters, X c K' 31 ' is the 
subset of feasible parameters, fd a ta( x ) is a data fitting 
term which enforces agreement of the model with the 
measurements y G R^l, / pr ior(aO is a regularization term 
and fi > is a so-called hyper-parameter used to tune 
the relative weight of the regularization. Following a 
Bayesian interpretation, /data is the opposite of the log- 
likelihood of the measurements, f pr \or is the opposite of 
the log of the prior distribution of the parameters, and 
x + defines the maximum a posteriori (MAP) estimate. 

Constraining the solution to belong to the feasible set 
X is a mean to impose strict constraints such as the non- 
negativity: 

X = {x G R n : x > 0} (2) 
where the inequality x > is to be taken element-wise. 

B. Direct model and likelihood 

In this study, we assume that the optical intcrferomet- 
ric data consist in Fourier transform of the brightness 
distribution measured for a finite set of spatial fre- 

quencies v = B/X with B the interferometric baseline 
(projected in a plane perpendicular to the line of sight) 
and A the wavelength [19]. 

The most practical representation of a multi-variate 
distribution such as I\(0) by a finite number of param- 
eters x consists in sampling I\(6) separately along its 
spatial and spectral dimensions. The image parameters 
are then: 

x„,£ ps I\ e (0 n ) (3) 

for A^ G W the list of sampled wavelengths and 9 n G A 
the list of angular directions, the so-called pixels. 

For the sake of notational simplicity, we use the same 
wavelengths in W as the ones of the data spectral chan- 
nels and we denote by yp,m,i the real (p — 1) or imagi- 
nary {p — 2) part of the complex visibility obtained with 
mth baseline in £th spectral channel. This notation is 
intended to clarify the equations and does not impose or 
assume that all baselines have been observed in all spec- 
tral channels. By considering that complex numbers are 
just pairs of real values, our notation also avoids dealing 



with complex arithmetic. In our framework, the model 
of the data is affinc: 

Vp,m,l = (H • x) p , m ,l + ep,m,e 

— J n 

where the term e accounts for noise and modeling ap- 
proximations. Formally, the coefficients of the operator 
H are given by [19]: 

f +cos(0l-B m /\ i ) forp = l 

tlp.m.n.t \ -T- (5) 

( -sin(6»,;-B m /A,) forp = 2 

with B m the mth observed baseline and 9^ B m the usual 
scalar product between B m and 9 n . 

At least because of the strict constraints imposed by 
the feasible set X, solving the image reconstruction prob- 
lem in Eq. (1) must be carried out by an iterative algo- 
rithm. Owing to the size of the problem, a fast version of 
H has to be implemented. First, we note that the model 
is separable along the spectral dimension (using a con- 
ventional matrix representation, H would have a block 
diagonal structure): 

y e = U e ■ x e + e e , (6) 

where the index I denotes the sub-vector or the sub- 
operator restricted to the coefficients corresponding to 
the £th spectral channel. With the generalization of 
multi-processor computers or multi-core processors, this 
property of the operator H may be easily exploited to 
parallelize the code to apply H (or its adjoint H T ) to 
a given argument. Second, an algorithm such as the 
nonuniform fast Fourier transform (NU-FFT) [20] can 
be implemented to speed up the computations by ap- 
proximating the operator by: 

H, ps R, F S (7) 

where F is the discrete Fourier transform (DFT), in- 
terpolates the discrete spatial frequencies resulting from 
the DFT at the frequencies observed in £th channel 
and S is a zero-padding and apodizing operator. Zero- 
padding improves the accuracy of the approximation, 
while apodization pre-compensates for the convolution 
by the interpolation kernel used in Rf [20]. Note that 
only the interpolation in the Fourier domain R^ depends 
on the spectral channel. In NU-FFT, S is diagonal, R^ 
is very sparse and F is implemented by a fast Fourier 
transform (FFT) algorithm, thus the approximation in 
Eq. (7) is very fast to compute. 

Assuming Gaussian noise distribution, the likelihood 
term writes: 

/ data (a0 = l - (H • x - y) T ■ W • (H • x - y) (8) 

where W G RH x lyl is a statistical weighting matrix; in 
principle, W is the inverse of the covariance matrix of 
the measurements: W = Cov(y) -1 . 
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C. Regularization based on spatial sparsity 

Due to the voids in the spatial frequencies covered by 
the observations, the constraints provided by the data 
alone do not suffice to define a unique image. The prior 
constraints imposed by / pr ior(^) are then required to help 
choosing a unique solution among all the images that are 
compatible with the measurements. 

In this paper, we focus on a particular type of astro- 
nomical targets which consist in a number of point-like 
sources with different spectral energy distributions. This 
includes the case of multiple stars, globular clusters, or 
groups of stars as observed in the center of our galaxy. 
For such objects, the most effective means to regularize 
the problem is to favor spatially sparse distributions, i.e. 
images with as few sources as possible on a dark back- 
ground. In this section, we derive expressions of the regu- 
larization term / pr i r(a3) suitable to favor spatially sparse 
distributions. 

1. Fully separable sparsity 

It is now well known that using the l\ norm as the regu- 
larization term is an effective mean to impose the sparsity 
of the solution while approximating the data [14]. This 
leads to take f pr \ r{x) = /sparse(^ ) with: 

/sparse^) = ||a;||l = £ Vk,l\ = Sgn(cc) T - X , (9) 

k,e 

where sgn(a;) is the sign function applied element-wise 
to the parameters x. When the parameters are non- 
negative, sgn(aj) = 1 with 1 = (1, . . . , 1) T . 

The regularization term f sp3 rse(x) in Eq. (9) is com- 
pletely separable. In our framework where the model 
is spectrally separable, the global criterion defined in 
Eq. (1) is therefore separable along the spectral dimen- 
sion. Provided data from different spectral channels are 
statistically independent, the image reconstruction can 
be solved independently for each spectral channel. 

If the wavelength samples \i of the discrete model x of 
I\ (9) do not coincide with the effective wavelengths of the 
data, spectral interpolation of the model is required to 
match the observed wavelengths. In this case, a certain 
spectral correlation is intrinsic to the model and the 3-D 
image reconstruction has to be performed globally even 
if the regularization does not impose any kind of spectral 
continuity. 

2. Non-separable spatial-only sparsity 

Physically, sources emit light at all wavelengths and we 
expect better image reconstruction if we can favor re- 
stored sources having the same position whatever the 
wavelength. Clearly, this is not achieved by the regular- 
ization /sparse (x) in Eq. (9) which is fully separable. In 
order to impose some spectral continuity while favoring 
spatial sparsity, we consider the following regularization 
instead [7, 15]: 

W*) = <,) V2 (io) 



with n the spatial index (pixel) and £ the spectral chan- 
nel. The fact that such a regularization favors spatial 
sparsity and spectral grouping is a consequence of the 
triangular inequality [15]. This sparsity prior is a spe- 
cial case of several recent generalizations such as group 
LASSO [21], mixed norms [22] or structured sparsity [23]. 

3. Explicit spectral continuity and gray object 

The penalization defined in Eq. (10) can be seen as a 
spatial regularization term which favors spectral group- 
ing but no spectral continuity nor spectral smoothness. 
Some authors [7, 8, 24] have shown the efficiency of ex- 
ploiting the spectral continuity of the sought distribution 
by using, in addition to a spatial regularization term, an 
additional spectral regularization like: 

/spectral {x) = £ \l n £(^,£+1 - X n j) 2 (11) 
n I 

with fi n > suitable regularization weights. In the limit 
fj,„ — > oo, Vn, the regularization in Eq. (If) amounts 
to assuming that the spectral energy distributions of all 
sources are flat, that is: 

x n ,i = g n , y(n,£) (12) 

where g is a gray image of the object. To speed up 
the reconstruction, only the gray image has to be re- 
constructed, using the model: 

yp,m,l = ^ Hp,m,n,£ 9n "t - Gp,m,^ 3 (1^) 
n 

and, to impose the spatial sparsity, the l\ regularization: 
/prior(g) = Hfflli = V \g n \ = sgn(g) T - g . (14) 
D. Optimization algorithm 

Most existing image reconstruction algorithms for optical 
interferometry (e.g., MiRA [I], Bsmem [2, 3] and WiSARD 
[4]) were designed for minimizing a smooth cost function. 
For that purpose, non-linear conjugate gradient method 
[25] or limited memory quasi-Newton methods such as 
VMLM-B [26] are quite efficient and easy to use as they 
only require computing the cost function and its gradi- 
ent. A notable exception is Macim[27] which is based 
on a Markov-Chain-Monte-Carlo (MCMC) optimization 
strategy suitable, in theory, for any type of criteria, in 
particular the non-smooth and non-convex ones; in prac- 
tice, this is however too computationally intensive for 
estimating a large number of parameters as it is the case 
for image reconstruction. When using non smooth regu- 
larizations as the ones in Eq. (9) and Eq. (10) to impose 
spatial sparsity, optimization algorithms based on New- 
ton method (that is, on a quadratic approximation of 
the cost function) are inefficient and completely differ- 
ent optimization strategies must be followed to solve the 
problem in Eq. (I) with non-differcntiablc cost functions. 
In our algorithm, we introduce variables splitting [28] to 
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handle the two terms of the cost function as indepen- 
dently as possible and we implement an alternating direc- 
tion method of multipliers [29] (ADMM) to solve the re- 
sulting constrained problem. The augmented Lagrangian 
with ADMM emerges as the most effective in the family 
of decomposition methods that includes proximal meth- 
ods [28], variable splitting with quadratic penalty [30], 
iterative Brcgman [31]. See [32] for detailed comparisons. 

1. Variable Splitting and ADMM 

Introducing auxiliary variables z, minimization of the 
two-term cost function in Eq. (1) can be recast in the 
equivalent constrained problem: 

min {/dataf^) + M /priori)} S.t. X = Z . (15) 

Imposing that x € X (i.e. x > 0) rather than z E X 
is not arbitrary and our motivation for that choice is 
explained in what follows. 

The augmented Lagrangian [25] is a very useful method 
to deal with constrained problems such as the one in 
Eq. (15). In our case, the augmented Lagrangian writes: 

C p (x,Z,u) = /data (z) + (J, /prior (x) 

+ u T - (x-z) + £\\x-z\\%, (16) 

with u the Lagrange multipliers associated with the con- 
straint x = z, p > the quadratic weight of the con- 
straints, and 1 1 1 1 2 the Euclidean (£2) norm of v. Note 
that taking p = yields the classical Lagrangian of the 
constrained problem. 

The alternating direction method of multipliers [29] 
(ADMM) consists in alternatively minimizing the aug- 
mented Lagrangian for x given z and m, then for z given 
x and u, and finally updating the multipliers u. This 
scheme, adapted to our specific problem in Eq. (15), is 
detailed by the following algorithm with the convention 
that v^' is the value of v at iteration number t: 

Algorithm 1. Resolution of problem (15) by alternat- 
ing direction method of multipliers. Choose initial vari- 
ables z^ and Lagrange multipliers u^°\ Then repeat, 
for t = 1, 2, . . . until convergence, the following steps: 

1. choose > and update variables x: 

a;W = argmin£ p(f ) (x, z ( - t ~ 1 \u ( - t ~ 1 ' ) ) 
ccex 

f z 9 '*' 11 ~(t) l|2l 

= argmin^ f pr i or (x) + — \ \x - x y \ (17) 

with: 

5 (*) =Z (*-D_ ti (*-l)/p(*) ; (18) 

2. update variables z: 

Z« = argmin£ p(t) (x (t \ z, u^) 

Z 

= argmin{/ da taO*) + ^ Ik-^lla} ( 19 ) 



with: 

z [t) =xW +u^/pW; (20) 
3. update multipliers u: 

„(*)=„(*-!) +p W p)- 2 W) . ■ (21) 

Our algorithm can be seen as an instance of SALSA 
[33] with however some improvements. First, we deal 
with the additional constraints that the variables are non- 
negative. Second, we allow for changing the weight of the 
augmented penalty at every iteration which can consider- 
ably speed-up convergence. Third, for real observations, 
the operator H cannot be easily diagonalized (e.g. by 
using FFT) thus the updating of variables z cannot be 
exactly carried out. Finally, we consider the possibility 
of warm starting the algorithm with a solution previously 
computed. This latter feature is of interest to improve a 
solution if too few iterations have been performed or to 
find the solution of the problem with a slightly different 
value of the hyper-parameter /i. 

In Appendix A, we show how the updating of the vari- 
ables x (step 1 of Algorithm 1) can be implemented 
taking into account the constraints that the parameters 
are non-negative. This is our motivation for imposing 
x 6 X on the variables x and not on the variables z. 
This avoids introducing additional auxiliary variables for 
the sole purpose of accounting for the feasible set. The 
formulae to update the variables x g X for f spa rse(x) and 
/joint(aO are respectively given by Eq. (A5) and Eq. (A13) 
in Appendix A. 

2. Solving for the auxiliary variables 

Since C p (x, z 1 u) is quadratic with respect to z, updating 
these variables (step 2 of Algorithm 1) amounts to solving 
the linear problem: 

A« ■ zW = b [t) (22) 

with: 

A (t) = H T - W ■ H + p w I (23) 
6 W =H T -W -y + p^xW+uV-V (24) 

with I the identity matrix (of suitable size). Since the 
augmented term is diagonal and provided data from dif- 
ferent spectral channels are statistically independent, this 
problem can be solved separately for each spectral chan- 
nel: 

Af-zf=bf, W, (25) 

with: 

Af = Hj W e ■ H e + pW I (26) 
bf = Hj WfV t + P (t) xf + u ( *~ 1} (27) 
using the same conventions as in Eq. (6). 
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In practice, we (approximately) solve these problems 
by means of the conjugate gradients algorithm [25, 34] 
and starting with the previous solution .z'* -1 ). Accord- 
ing to Eckstein-Bertsekas theorem [35], the ADMM al- 
gorithm is proved to converge provided that approxima- 
tions in the update of auxiliary variables z be absolutely 
summablc. That is: 

oo 

Ell* (t) -*exL|| 2 <^ (28) 
t=l 

must hold with z exact the exact solution of Eq. (19) and 
z^> the approximate solution of Eq. (19) returned by the 
conjugate gradient iterations. In Appendix B, we show 
how to set the stopping criterion of the conjugate gradi- 
ent method so that the constraint in Eq. (28) holds. In 
our tests, we however simply stop the conjugate gradi- 
ent iterations when the Euclidean norm of the residuals 
of Eq. (22) becomes significantly smaller than its initial 
value: 

|| A « • - 6 (t) || 2 < ecc ||A« • - b (t) || 2 (29) 

with ecG G (0,1)- For our tests, we took ecG = 10~ 2 
and allowed a maximum of 5 conjugate gradient itera- 
tions. With this simple prescription, we did not experi- 
ment any divergence of the global algorithm although it 
may depend on the problem at hands. 

3. Stopping criteria 

At the solution {x* , z* ,«*} of problem (15), Karush- 
Kuhn- Tucker (KKT) conditions of optimality [25] stipu- 
late that, the constraints must be satisfied and that the 
solution must be a stationary point of the Lagrangian: 

x*=z* (30) 
G d x Cv{x\z\u*) = tidf prior (x*)+u* (31) 
G d x £ (x*,z*,u*) = df data (z*) - u* (32) 

where d denotes the subdiffcrcntial operator [29]. Since 
/data is diffcrcntiablc, G and <9/d a ta can be replaced by 
= and by V/dataj the gradient of /d ata in the third KKT 
condition (32) which becomes: 

U* = V/data(**). (33) 

If z^' exactly minimizes £ p <t) (x^\ z, i/' -1 '), we have: 

V/data(2 (t) ) - + P (t) (Z^ - *<*>) = 

=> V/data(^ (i) )=« (t) , (34) 

thus the 3rd KKT condition in Eq. (33) is automatically 
satisfied at the end of an exact ADMM iteration. 

As we use the method of conjugate gradients to solve 
Eq. (22), Eq. (34) is only approximately satisfied. More- 
over, updating the multipliers u according to step 3 of the 
algorithm may be subject to accumulation of rounding 
errors. The stability of the algorithm or its convergence 
rate may be improved by taking u^' = V/ d ata(z^)- In 



our tests, tough, we have not seen significant differences 
between updating the Lagrange multipliers according to 
Eq. (21) or according to Eq. (34). 

Since x^> minimizes C p a) (x, z^ -1 ', u^ 1-1 '), we have: 

G pd.f p Ux {t) ) + « (t_1) + P W (* (t) - z (t ~ 1] ) 
G M 9/ prior ( a; W) + M W+pW 
=> - (*<*> - z^) G lidf^ix®) + «") 

thus: 

S W = p {t) ^(t) _ Z (t-D) (35) 

can be seen as the residuals for the 2nd KKT condition 
in Eq. (31), while: 

rW = x^ - z« (36) 

are the residuals for the primary constraint in Eq. (30). 

Finally, the KKT conditions imply that the so-called 
primal and dual residuals [29] defined in Eq. (36) and 
Eq. (35) must converge to zero. Following [29], we there- 
fore stop the algorithm when: 

\\r ( %<r$ m and ||.<*>|| a <rj£„ (37) 
where the convergence thresholds are given by: 

7p ( rim = ^ £ ^s + £rel max(| \ X « || 2 , ||* (t) || 2 ) , (38) 

r d ( * al =ViVeabs + e re i||M (t) || 2 , (39) 

where N = Card(a;) is the number of sought parameters, 
£abs > and e re \ G (0, 1) are absolute and relative conver- 
gence tolerances. For our tests, we found that e a b s = 
and e re i = 10 -3 yield sufficient precision for the solution. 

4. Initialization and warm start 

Given an initial estimate z^ for the auxiliary variables, 
the 3rd KKT condition in Eq. (33) suggests to start 
the iterative algorithm with initial Lagrange multipliers 
given by: 

« (0) = V/data(* (0) ) = H T W (H T - *(°) - y) . (40) 

Starting the algorithm as suggested has several advan- 
tages. First, there are no needs for the initial variables 
z' ' to belong to the feasible set. Second, this readily 
provides initial Lagrange multipliers. Note that starting 
with an initial estimate x 1 * ' for the variables x would 
have the double drawback that x^ must be feasible and 
that since /priori) may be non-diffcrcntiablc it does not 
yields an explicit expression for the initial Lagrange mul- 
tipliers. 

The other required initial setting is the value of the 
augmented penalty parameter p^> used to compute the 
first estimate x^ of the variables x given z^°> and u^°\ 
For the subsequent iterations, p can be kept constant 
or updated according to the prescription described in 
Sec. 2D 5. 
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To continue the iterations or compute a solution with 
slightly different parameters (e.g. the rcgularization pa- 
rameter p), the possibility to restart the algorithm with 
the output of a previous run with no loss of performances 
regarding the rate of convergence is a needed feature. 
This is called warm restart and is simply achieved by 
saving a minimal set of variables upon return of the al- 
gorithm. Since each iteration of our algorithm starts by 
computing the variables x given the auxiliary variables z, 
the Lagrange multipliers u and the augmented penalty 
parameter p, it is sufficient to save {z,u,p} for being 
able to warm restart the method. 

5. Tuning the augmented penalty parameter p 

One of the important settings of the ADMM method is 
the value of the augmented penalty parameter p: if it is 
too small, the primal constraints x = z will converge too 
slowly; while the cost functions will decrease too slowly 
if p is too large. Some authors, e.g. [33], use a constant 
augmented penalty parameter for all the iterations which 
requires trials and errors to find an efficient value for 
p. In fact, it is worth using a good value of p at every 
iteration of ADMM to accelerate the convergence [29] . In 
this section, we describe means to automatically derive 
a suitable value for the augmented penalty parameter 
following a simple reasoning. 

The convergence criterion defined in Eq. (37) is equiv- 
alent to have: 



(t) < 1 with 



-(t)\ 



r (t) 

prim 



T (t) 

'dual 



(41) 



According to the updating rules in one ADMM iteration 
(see Algorithm 1), <jP> does only depend on z'* -1 ), it^* -1 ' 
and p^\ The augmented penalty parameter is there- 
fore the only tunable parameter that has an incidence on 
the value of for the t th ADMM iteration. The idea is 
then to chose the value of p^' so as to approximately min- 
imize 4>^ . In terms of number of ADMM iterations, we 
expect to achieve the faster convergence of the algorithm 
in that way. However, tuning p at every ADMM iteration 
requires to repeat each iteration for different values of p 
and has therefore the same computational cost as sev- 
eral ADMM iterations. A compromise has to be found 
between the accuracy on p and the number of trials. 
Our objective is to derive an economical way to find: 



argmin0 t (p) 
p 



(42) 



where 4>t (p) is the value taken by (j)^ when = p. All 
the quantities, Hr^l^, ||s 



2> T P rim> an d T d V a |, involved in 



vary continuously (though not necessarily smoothly) 
with respect to p^; hence, considering the definition of 
p^p and , we obtain the following implication: 



It) 
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prim 
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(43) 



Besides, the norm of the primal residuals ||rO||2 is a 
decreasing function of p^; while the norm of the dual 
residuals ||s(*)||2 is an increasing function of p^ [29] and, 
close enough to the solution, the values of Tp r j m and Tduai 
should converge to their final values and thus not de- 
pend too much on p. Under these assumptions, the ratios 

l^/fprjm and 112/^dual should also be decreasing 
and increasing functions of p respectively. Close to the 
solution {x*, z* , u*} of the problem, the necessary condi- 
tion in Eq. (43) is therefore also a sufficient condition to 

define the optimal value p£\ These considerations lead 
us to choose pW such that: 



7««1 with «W = 



I 112 dual 
1 2 pnm 



.(*) 



(44) 



which is expected to be a decreasing function of p^ 1 close 
to the solution. A better alternative may be to choose 
p^i such that: 



(*) 

fait 



1 with 



ft) def 

fait = 



-(*) I 



r (t-l) 
12 dual 



I T (t ~ 1} 
1 2 prim 



(45) 



Indeed, as t^*^ 1 ^ and Tp r]n ^' do not depend on p^\ rj^ is 
always decreasing function of p^ while approaching 
when the algorithm is close to the solution. The following 
algorithm implements our safeguarded strategy to find 
pW > such that 77W w 1 or 77^ sa 1. 

Algorithm 2. Tuning of the augmented penalty pa- 
rameter p so that r\ f=a 1. Choose er>0, r > 1, 7 > 1 
and an initial value for p, and set p min = 0, p max = +00. 
Then, until convergence, repeat the following steps: 

1. Update x, z, and u according to ADMM updating 
rules. Compute 77, defined in Eq. (44) or in Eq. (45), 
and 4>, defined in Eq. (41). 



2. If 1/t < 7] < t or 4> < a i 
solution and stop. 



(t-i) 



, accept the current 



3. If i] < 1/t, then p is too large; let p ml 

J Pm\n Pmax if Pmin > 

P ' { Pmax/7 otherwise 



p and 



then go to step 1. 
4. If 77 > r, then p is too small; let p min := p and 

\J Pmin Pmax h Pmax ^ CXD 

7 Pmin otherwise 

then go to step 1. ■ 

The following remarks clarify some aspects of this al- 
gorithm: 
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• To simplify the notations, we dropped the index t 
of the ADMM iteration in the equations of Algo- 
rithm 2. The updating of variables (step 1) must 
be understood as computing x^ l \ etc. given 
z^^ 1 ^ and u^ l \ and assuming pW = p. 

• Except for the very first ADMM iteration (t = 1), 
the initial value for p is the previous selected value 
p(t— i), p or the g rs t iteration, we derive an initial 
value of p such that: 

pi 1 ) — argmin/ data (z (1) - it (1) /p) 
_ «W T - H T W H «W 

~ «(1)T. U (1) (46) 

with uW = V/data(^ (1) ) = H T W (H z« - y) 
and which amounts to have the best ar , defined in 
Eq. (18), with respect to /data- This choice has the 
advantage of avoiding an initialization with an ar- 
bitrary value for p. The rule does however not yield 
an efficient strategy for tuning p at every iteration. 

• Algorithm 2 is safeguarded in the sense that it 
maintains a strict bracketing p mm < p^ < p msx 
of the solution. 

• In step 2 of Algorithm 2: The value of p is accepted 
when 1/t < r/ < r which, with r > 1, is how 
we express that 77 « 1. We achieved good results 
with t = 1.2 in our tests. The current value of 
p is also accepted, if the relative reduction in the 
convergence criterion (jyf', defined in Eq. (41), is 
better than a with respect to the previous iteration. 
This shortcut helps to reduce the number of inner 
iterations. To avoid this shortcut, it is sufficient to 
take: a = 0. We took a = 0.9 in our tests. 

• In step 3 and step 4 of Algorithm 2: When p 
has been bracketed by (p m \n, Pmax), taking p = 
-y/Pmin Pmax, that is the geometrical means of the 
end points, is similar to a bisection step in a zero 
finding algorithm. 

• For the first ADMM iteration, the magnitude of p 
is not yet known so to avoid too many iterations, 
we use a larger value of the loop gain 7, say 7 = 10 
when t = 1 and 7 = 1.5 for t > 1. 

Another possibility is to always accept an ADMM it- 
eration and simply use the value of to determine 
whether p should be reduced, kept the same, or aug- 
mented for the next iteration. For instance: 

{7P ( * ) if r\ t > t, 
p«/ 7 if th < 1/r, (47) 
p^> else; 

with t > 1 and 7 > 1. In words, p is augmented (by 
multiplying it by a factor 7) whenever the relative size 



of the primal residuals is significantly larger than that 
of the dual residuals; while p is reduced (by dividing it 
by a factor 7) whenever the relative size of the primal 
residuals is significantly smaller than that of the dual 
residuals. This strategy is similar to the one described 
by Boyd et al. [29] except that our prescription properly 
scales with the magnitudes of the residuals and of the 
objective function of the problem so we expect a better 
behavior. 

E. Debiasing the solution 

One of the drawback of sparsity imposed by means of the 
l\ norm is that it yields a result which is biased toward 
zero [36]. This is shown by the recovered spectra in Fig. 2 
and by the brightness distributions of the pixels in Fig. 3. 
Since the sparsity constraint really improves the detec- 
tion of the sources, the resulting image can be used to 
decide where the sources are. By thresholding the gray 
image or the wavelength integrated multi-spectral image 
resulting from a reconstruction with a spatial sparsity 
constraint, we define a sparse spatial support S contain- 
ing all detected sources. Then, as proposed in [37], for 
the debiasing step, we minimize the likelihood function 
/data (25) with non- negative constraints only over x$ de- 
fined as the parameters x restricted to the support S 
while keeping all other parameters equal to zero. As the 
sub-matrix H5 containing the columns of H restricted 
by iS is well conditionned, no additionnal prior is needed 
to define the debiased solution. 

3. Results 

To check the proposed algorithm, we simulated a cluster 
of 50 stars with random positions and luminosities and 
with spectra randomly taken from the library compiled 
by Jacoby et al. [38]. The field of view is 128 x 128 pixels 
with 0.5 milliarcseconds/pixel and we took 100 spectral 
channels from A = 493 nm to A = 507 nm by steps of 
A A = 0.14 nm. To simulate the observations, we took 
100 random interfcrometric baselines with a maximum 
baseline of 180 m. We added Gaussian white noise to the 
complex visibilities with a level such that the maximum 
signal-to-noise ratio (SNR) is equal to 100. 

For the image reconstructions, we considered three dif- 
ferent cases: the reconstruction of a multi-spectral distri- 
bution with the regularization /sparse (x) in Eq. (9), or the 
regularization fj \nt(x ) in Eq. (10), and the reconstruction 
of a gray object g with the regularization /sparse (g)- In 
order to set the relative weight of the priors, we choose 
the value of the hyper-paramctcr p which yields an im- 
age which has the least mean square error with the true 
distribution. Once the values of p and p are chosen, 
the reconstruction of a 128 x 128 x 100 distribution from 
~ 2 x 10 4 measurements takes about 4 minutes on a 
GNU/Linux workstation with a quad-core processor at 
3 GHz and using a multi-threaded version of FFTW [39] 
to compute the discrete Fourier transforms. 

Figure 1 shows the integrated flux, i.e. for 
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Fig. 1. Integrated flux for the star cluster. From top 
to bottom, left to right: true object; reconstruction with 
fully separable sparsity prior; reconstruction with spatial 
sparsity and assuming a gray object; reconstruction with 
joint-sparsity prior. The spectra of the sources encircled 
by the boxes are shown in Fig. 2. Axes units are in 
milliarcseconds . 



the true distribution and for the reconstructed ones. In 
all cases, sparsity priors effectively yield a solution with 
point-like structures. However, when there is no trans- 
spectral constraints, only a few sources are correctly 
found and there are many more spurious sources. When 
using f] \nt(x) or assuming a gray object, the estimated 
integrated luminosity is much more consistent with that 
of the true object: all existing sources are found and 
the spurious sources are not only less numerous but also 
much fainter than the true ones. This is shown by the 
brightness distributions depicted by Fig. 3 and Fig. 4. 

Figure 2 shows the spectra of the two stars encircled by 
boxes in Fig. I, clearly the spectra recovered with /joint {x) 
(thin curves marked with open triangles) are of much 
higher quality than the spectra estimated when treating 
the spectral channels independently (thin curves marked 
with open boxes). Compared to the true spectra (thick 
lines) there is however a small but significant bias in the 
spectra obtained with /joint (x). This is not unexpected 
as the mixed norm implemented by /joint (x) results in an 
attenuation as shown by Eq. (All). 

As shown by Fig. 4, in the reconstructed gray image 
or in the image reconstructed with /joint (a?) , ah sources 
whose mean flux is greater than 1 are true positive detec- 
tions while all false positives have a smaller mean flux. 
We therefore select the sources with mean fluxes greater 
than this level to apply the debiasing method described 
in Sec. 2 E to effectively remove this bias as shown by the 




wavelength (nm) 

Fig. 2. Spectra of two selected sources. Each panel 
show the spectra of one of the sources encircled by the 
boxes in Fig. 1. Thick lines are for the true spectra and 
thin lines with markers for the restored spectra. The 
open squares indicate the reconstruction with fully sepa- 
rable sparsity prior f sp arse(x ); the open triangles indicate 
the reconstruction with joint sparsity prior /joint (x)] the 
filled triangles indicate the restored spectra after debias- 
ing (which are virtually indistinguishable from the true 
ones). 



thin curves with filled triangles in Fig. 2. 

In our reconstructions with the regularization /joint ( x ) , 
we found that <^W < icr 3 was a good threshold for the 
global convergence of the algorithm and we compared 
the different strategies proposed in Sec. 2 D 5 to set the 
augmented penalty parameter p. The evolution of the 
convergence criterion for some of these strategies is plot- 
ted in Figure 5. With a constant value for p, we ob- 
served that the rate of convergence is quite sensitive to 
the value of the augmented penalty parameter. Indeed 
with p = 3x 10 2 (which is the best value we found), the al- 
gorithm converged in 455 s, while it took 1 071 s and 847 s 
with p = 10 2 and p = 10 3 respectively. Although we did 
not try many different values for the parameters r and 7, 
we found that the automatic strategies for setting p with 
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Fig. 3. Histograms of the mean fluxes of the sources for 
the true object (in black), for the 3-D images restored 
with fully separable sparsity (in white) and with joint- 
sparsity (in dark gray) priors, and for the 2-D gray image 
restored with sparsity prior (in light gray). The vertical 
scale has been truncated to focus on the distributions of 
the brightest sources. 
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Fig. 4. Histograms of the mean flux of the true and 
false positive detection in the reconstructions under joint 
sparsity and gray sparsity priors. A positive detection is 
defined as a pixel with non-zero mean flux in the recon- 
struction. The vertical scale has been truncated to focus 
on the distributions of the true positive sources. 



Algorithm 2 and r]^ or according to Eq. (47) failed with 
their convergence criterion oscillating with r^*' m 10 -2 . 
In fact, in spite of the loss of time due to the number of 
retries needed to find a correct value for p at each ADMM 
iteration, we found that the best strategy was to use rj^l 
in Algorithm 2. In this case, it seemed to be better to 
use a tighter tolerance for 77^ rj 1 as with r = 3 the al- 
gorithm converged in 402 s, while it took only 250 s with 
r = 1.2 (sec Fig. 5). 



10- 



10- 4 - 



10- 




time (s) 



Fig. 5. Evolution of the convergence criterion <jft\ de- 
fined in Eq. (41), for different strategies to choose the 
augmented penalty parameter. Dashed curves are for a 
constant p. Solid curves are for p automatically set to 
have r$ 



1. 



4. Discussion and Perspectives 

We have shown the importance of using trans-spectral 
constraints to improve the quality of the restoration of 
the multi-spectral brightness distribution I\(9) of an 
astronomical target from optical interfcrometric data. 
These results confirm what has been observed for other 
types of data (like integral field spectroscopy). 

For the moment, our demonstration is restricted to 
specific objects which are spatially sparse {e.g. point-like 
sources) and must be generalized to other types of spatial 
distributions. Being implemented by non-differcntiable 
cost functions, spatial sparsity requires specific optimiza- 
tion algorithms. We have shown that variable splitting by 
the alternate direction method of multipliers (ADMM) 
is suitable to solve the optimization problem in a short 
amount of time. In addition to being able to deal with 
non-differentiable criteria, the ADMM method leads to 
splitting the full problem in sub-problems that are easier 
to solve and that may be independent. This straightfor- 
wardly gives the opportunity of speeding up the code, 
e.g. by means of parallclization. This possibility remains 
if other priors are used, e.g. to account for a smooth 
spatial distribution. 

To simplify the problem at hand, we considered that 
complex visibilities have been measured. At optical wave- 
lengths, this is only possible with phase referencing [40]. 
In order to process most existing interfcrometric data, 
we will have to modify the likelihood term /d a ta(^) and 
use a non-linear method {i.e. not the linear conjugate 
gradients) to update the auxiliary variables z. The new 
algorithm that we proposed, because it splits the two 
cost functions, jdata(z) and /priori), nray however be an 
efficient alternative to the variable metric method used 
in MiRA [1, 26] or the non-linear conjugate gradients 
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method in Bsmem [2, 3] which considers directly the sum 
of the cost functions. 



Imposing the non-negativity is straightforward and 
yields: 
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Appendix A: Proximity Operators for Spatial 
Sparsity of Non-Negative Variables 

Updating of the variables x by Eq. (17) and (18) in the 
ADMM method consists in solving a problem of the form: 



> a f( x ) + o W x ~ x \\l 



(Al) 



[■JV 



l+, we denote 



with a = n/p^ > and f(x) = f pr \or(x)- Solving prob- 
lem (Al) is very close to applying the so-called proximity 
operator (also known as Moreau proximal mapping) of 
the function af(x) which is defined by [28]: 

prox af (x) = argmin j a f(x) + - \\x - x\\ 2 2 \ . (A2) 

reel" I 1 J 

Proximity operators for non differentiable cost functions 
like /sparse^) or /joint (sc) have already been derived [28] 
and we simply need to modify them to account for the 
additional constraint that x € X. Since X is the subset 
of non-negative vectors of 
by: 

prox+ / (S) = argmin ja f(x) + i \\x-xf 2 \ 



the modified proximity operator that we use to update 
the variables x in our algorithm while accounting for non- 
negativity. 

1. Separable Sparsity 

The proximity operator for f S p 3rS s{x) defined in Eq. (9), 
that is the fj-norm of x, is the so called soft thresholding 
operator [28]: 

{x n ,t - a if x n j > a ; 
x n ,e + a if x n j < -a ; (A4) 
else. 



prox^ 



(x) r 



x n p - a if x n j > a : 







else. 



(A5) 



This shows that if a > m.ax n ,£%n,£i the output of the 
proximity operator is zero everywhere. 

2. Spatio- Spectral Regularization 

Using the joint-sparsity regularization /joint (x) given by 
Eq. (10), we aim at minimizing the criterion: 



1/2 1 



"2 W x ~ x h 



/joint \X ) 



which is strictly convex with respect to the variables x 
[15]. We note that c{x) is separable with respect to the 
pixel index n. Thus all computations can be done in- 
dependently for the spectral energy distribution of each 
pixel. 

Considering first the unconstrained case and for vari- 
ables x such that the function fj - mt (x) is differentiable, 
minimizing c{x) with respect to the variables x amounts 
to finding the root of the partial derivatives of c{x): 



dc{x) 
dx n .e 



= 



— X n ,t + (x n j - Xn,t) = 
Pn 



1 + a/t3 n 



with: 



fin 



1/2 



(A6) 



(A7) 



the Euclidean norm of the spectral energy distribution 
of the nth pixel of x. Assuming for the moment that 
fi n > 0, otherwise c(x) is not differentiable, combining 
Eq. (A6) and Eq. (A7) yields: 



fin = 



fin 



O-l fin 

since a//3„ > and with: 

fin = x n 

Solving Eq. (A8) for fi n yields: 

fin =fin~ 



1/2 



(A8) 



(A9) 



(A10) 



The non-differentiable case occurs when the above ex- 
pression yields a value of fi n which is not strictly posi- 
tive, that is when B n < a, in which case the minimum 
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of the cost function is given by x n ,. 
proximity operator of a/j int(f) is: 



0, W. Finally, the 



P rOX a/ Join »n, 




x n ,£ if fin > a ; 

else 

(All) 

where f3 n is the Euclidean norm of the spectral energy 
distribution of the nth pixel of x defined by Eq. (A9). 

In the differentiable case, requiring that x n> i > yields 
a simple modification of the unconstrained solution given 
by Eq. (A6): 



X n ,£ 



max(0, x n ,i) 
1 + a/Pn 



(A12) 



since 1 + a/(3 n > 0. The rest of the reasoning is sim- 
ilar than the unconstrained case except that f3 n has to 
be replaced by /3+ the Euclidean norm of the spectral 
energy distribution of the nth pixel of max(0,a;). The 
proximity operator of /joint {x) modified to account for 
non-negativity is finally: 



p rox i/ joiBt 



(x) = prox Q/ioint (max(0,a;)) 



(A13) 



Appendix B: Stopping criterion for the conjugate 
gradient method 

We derive here a possible strategy to set the stopping cri- 
terion for the conjugate gradient method used to update 
the auxiliary variables z so as to guarantee the global 
convergence of the ADMM method. If 2exLt is the solu- 
tion of the linear system A'*) • z = 6^, then: 

0; 

while, for the approximate solution: 

A« • - &<*> = „W , 

where r" are the so called residuals at the end of the 
conjugate gradients iterations. Then: 



A (t) . ,(*) _ h (t) 



*«-*«a=[AW]" 1 .« 



(t) 



As AW = H T W H + I with > and since 
H T - W • H is at least positive semi-definite, the smallest 
eigenvalue of A'*' is greater or equal pw, thus: 



To have: 



At) 



CO 

El 

t=i 



At) 



At) 



< \\v 



(01 



Jp {t) ■ 



z (t) II <oo 

''exact || 2 ^ 



a sufficient condition is therefore to make sure that: 



Y}\v®\\jpV <oo 



This can be achieved by imposing at each iteration that 
the stopping criterion for the conjugate gradients be such 
that: 

||« (t) || 2 <7CGP (t) ^ G (Bl) 
with 7cg > and £cg & (0,1) since then: 



El 

t=i 



At) _ „W 



" exact 1 1 2 



< 



EIK } ll> (t 



*=i 



<7CG 



7CG CCG 
1 - ' 

Note that the sum is finite if and only if |£cg| < 1- 
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